Jupyter Notebook Tutorial

A few of the referenced functions (e.g., AnovaRM, Markdown calling Python variables, and anything involving live MATLAB or R code) will not run on the Binder instance of Jupyter. However, most everything else should.

Getting set up

Install Anaconda

  • I'd recommend getting the latest version of Python (version 3.6 at time of writing).
  • Also use this to get all the pythons:

    # install everything with Python 2 and 3. 
      conda create -n py36 python=3.6 anaconda
      conda create -n py27 python=2.7 anaconda
    
      # register py27 kernel - no need for "source" on windows
      source activate py27
      ipython kernel install
    
      # same for py36, and install juptyerhub in the py36 env
      source activate py36
      ipython kernel install
      pip install jupyterhub
    

Troubleshooting

In the event that for some reason your Jupyter instance of Python isn't seeing your installed packages, this means that it's probably pointing to the wrong Python or the wrong path. First, diagnose the problem in Jupyter by running !which python (on Mac/Linux) and import sys; sys.path. The first command checks where it's looking for Python itself (using the terminal), while the second says where it's looking for packages. These answers should look the same from your own terminal as well -- if the answers differ between Jupyter and your terminal, then you've found your problem.

You should be able to fix either problem by activating your chosen environment, and running python -m ipykernel install --user

Install necessary packages

  • pip install insert_package_name_here
  • You might have to preface that with sudo if you're on a Mac.
  • Alternatively, use conda install insert_package_name_here if you run into issues with pip
  • conda install -c conda-forge insert_package_name_here is also an option for certain packages.

Optional packages

  • You're probably going to want the following packages (though some may already be installed via Anaconda):
    • bokeh
    • holoviews
    • jupyter
    • jupyter_contrib_nbextensions
      • Run the following command for this after install: jupyter contrib nbextension install --user
    • jupyterthemes
      • Use if you're not happy with the default aesthetics of the notebook
      • Run at terminal for (most of) my aesthetic setup: jt -t grade3 -fs 12 -tfs 12 -nfs 115 -cellw 88% -T
      • If you don't like it, you can always go back to the default: jt -r
    • matplotlib
    • nbopen
      • Used to associate .ipynb files with Jupyter in your file manager
        • Linux/BSD: python -m nbopen.install_xdg
        • Windows: python -m nbopen.install_win
        • Mac: Clone the repository and run ./osx-install.sh
    • numpy
    • pandas
    • pivottablejs
    • prettypandas
    • matlab_kernel and pymatbridge
      • For using MATLAB
      • If pymatbridge doesn't work, go to matlabroot\extern\engines\python and run python setup.py install
    • rpy2
      • For using R
      • More instructions in relevant section below
    • scipy
    • seaborn
    • statsmodels
    • wes
      • Optional package for Wes Anderson-style color palettes

Optional Psychopy environment

  • You can even create a Psychopy environment:
    • Get the appropriate .yml file from Gary Lupyan's github
    • Save it somewhere as, say, psychopy.yml
    • Go to that location and run: conda-env create -f psychopy.yml -n psychopy
    • After that you can always use it via source activate psychopy (no need for source on Windows)

Open Jupyter notebook from terminal or cmd

  • jupyter lab or jupyter notebook
    • Make sure to cd into the directory you want to run it in (or at least a directory higher than the one you want; you can't go higher from within the notebook instance, nor can you go laterally!)
    • You can switch between views by navigating to http://localhost:8888/lab or http://localhost:8888/tree respectively.

nbextensions

  • Enable your favorite nbextensions (below I've listed mine).
    • Tree Filter
    • table_beautifier
    • Variable inspector
    • Codefolding
    • Chrome clipboard
    • Codefolding in editor
    • contrib_nbextensions_help_item
    • nbextensions dashboard tab
    • Collapsible Headings, with add a control, adjust size of toggle controls, gray bracketed ellipsis, command-mode, collapse with ToC2
    • Python Markdown
      • must be trusted notebook to use properly -- enable trust at top-right of notebook
    • Table of Contents (2), with auto-number, sidebar, widen display, display toc as navigation menu, move title and menu left instead of center, and collapse
      • can export notebook to HTML with table of contents with: jupyter nbconvert --to html_toc FILENAME.ipynb
      • if you get an error that says "No such module as 'pre_pymarkdown'", then you will need to do the following:
        • find "pre_pymarkdown.py" on your computer and add it to the PYTHONPATH environment variable
        • add the following to your "jupyter_nbconvert_config.py" file:
          c = get_config()
          c.Exporter.preprocessors = ['pre_pymarkdown.PyMarkdownPreprocessor']
          

Markdown Tutorial

Double-click on the cells to see how everything was written!

Headings

Headings are made with preceding "#" signs. <h1> is #, <h2> is ##, etc.

White space

Force new blank lines with <br> .

Emphasis

Italics are made by surrounding a word or phrase with asterisks, or with underscores, like so.

Bold words are made by surrounding a word or phrase with 2 asterisks on each end.

You can make a phrase both bold and italic by combining the above!

Unordered Lists

  • Dashes make bullets
    • And tabbing first makes a sub-bullet
      • You can also just use a single space instead of a tab character; just be consistent.

Ordered Lists

  1. You can make ordered lists with a number followed by a dot.
  2. Here's another point.

Blockquotes

Put a ">" before a line to turn it into a blockquote.

Code

Unhighlighted code goes between backticks: this is code

And you can define blocks of code by sandwiching them between 3 backticks on either end (you can even define syntax highlighting!)

x = [1, 2, 3]
for i in x:
    print(i)

Hyperlinks go in square brackets, with the link itself going in parentheses immediately after (no whitespace allowed between neighboring brackets)!

Images are set up just like hyperlinks, but with an exclamation point in front. The writing in square brackets serves as the alt-text for the image.

Yale Psychology Department

Embed HTML, including video

In [2]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/HW29067qVWk" frameborder="0" allowfullscreen></iframe>

This works for live websites, too!

In [3]:
# %%HTML
# <iframe src="https://fiddle.jshell.net/rahonavis75/ed4486f9/show/" width="800" height="500">

LaTeX

Sandwich your LaTeX between two dollar signs.
$$ \begin{equation*} \left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right) \end{equation*} $$

If you wanted you could literally write your paper in Jupyter notebook! To do this, you would collapse your analysis script with your manuscript by feeding the results of the fomer directly into the latter. Here's an example where I feed a variable into a Markdown cell.

In [4]:
foo = 100

foo is {{foo}}

Jupyter commands

Magic commands

See all commands.

In [5]:
lsmagic
Out[5]:
Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cd  %clear  %cls  %colors  %config  %connect_info  %copy  %ddir  %debug  %dhist  %dirs  %doctest_mode  %echo  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %macro  %magic  %matplotlib  %mkdir  %more  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %ren  %rep  %rerun  %reset  %reset_selective  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%cmd  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

See list of current variables in global scope. Can also specify a data type thereafter.

In [6]:
%who
NamespaceMagics	 foo	 get_ipython	 getsizeof	 json	 var_dic_list	 

Terminal commands

And run terminal commands directly with "!"

In [7]:
!pip list
alabaster (0.7.10)
alembic (0.9.6)
anaconda-client (1.6.5)
anaconda-navigator (1.6.9)
anaconda-project (0.8.0)
asn1crypto (0.22.0)
astroid (1.5.3)
astropy (2.0.2)
babel (2.5.0)
backports.shutil-get-terminal-size (1.0.0)
beautifulsoup4 (4.6.0)
bibtexparser (0.6.2)
bitarray (0.8.1)
bkcharts (0.2)
blaze (0.11.3)
bleach (2.1.2)
bokeh (0.12.10)
boto (2.48.0)
Bottleneck (1.2.1)
CacheControl (0.12.3)
certifi (2017.7.27.1)
cffi (1.10.0)
chardet (3.0.4)
click (6.7)
cloudpickle (0.4.0)
clyent (1.2.2)
colorama (0.3.9)
comtypes (1.1.2)
contextlib2 (0.5.5)
cryptography (2.0.3)
cssselect (1.0.1)
cycler (0.10.0)
Cython (0.26.1)
cytoolz (0.8.2)
dask (0.15.3)
datashape (0.5.4)
decorator (4.2.1)
distlib (0.2.5)
distributed (1.19.1)
docutils (0.14)
entrypoints (0.2.3)
et-xmlfile (1.0.1)
fastcache (1.0.2)
feedfinder2 (0.0.4)
feedparser (5.2.1)
filelock (2.0.12)
Flask (0.12.2)
Flask-Cors (3.0.3)
gevent (1.2.2)
glob2 (0.5)
greenlet (0.4.12)
h5py (2.7.0)
heapdict (1.0.0)
holoviews (1.9.1)
html5lib (1.0.1)
idna (2.6)
imageio (2.2.0)
imagesize (0.7.1)
ipykernel (4.8.0)
ipypublish (0.6.5)
ipython (6.2.1)
ipython-genutils (0.2.0)
ipywidgets (7.0.0)
isort (4.2.15)
itsdangerous (0.24)
jdcal (1.3)
jedi (0.11.1)
jieba3k (0.35.1)
Jinja2 (2.10)
jsonschema (2.6.0)
jupyter-client (5.2.1)
jupyter-console (5.2.0)
jupyter-contrib-core (0.3.3)
jupyter-contrib-nbextensions (0.3.3)
jupyter-core (4.4.0)
jupyter-highlight-selected-word (0.1.0)
jupyter-latex-envs (1.4.1)
jupyter-nbextensions-configurator (0.4.0)
jupyterhub (0.8.1)
jupyterlab (0.27.0)
jupyterlab-launcher (0.4.0)
jupyterthemes (0.18.2)
lazy-object-proxy (1.3.1)
lesscpy (0.12.0)
llvmlite (0.20.0)
locket (0.2.0)
lockfile (0.12.2)
lxml (4.1.1)
Mako (1.0.7)
MarkupSafe (1.0)
matlab-kernel (0.15.0)
matlabengineforpython (R2017b)
matplotlib (2.1.0)
mccabe (0.6.1)
menuinst (1.4.10)
metakernel (0.20.12)
mistune (0.8.3)
mpld3 (0.3)
mpmath (0.19)
msgpack-python (0.4.8)
multipledispatch (0.4.9)
navigator-updater (0.1.0)
nbconvert (5.3.1)
nbformat (4.4.0)
networkx (2.0)
newspaper3k (0.2.5)
nltk (3.2.4)
nose (1.3.7)
notebook (5.3.1)
numba (0.35.0+10.g143f70e)
numexpr (2.6.2)
numpy (1.13.3)
numpydoc (0.7.0)
odo (0.5.1)
olefile (0.44)
openpyxl (2.4.8)
packaging (16.8)
pamela (0.3.0)
pandas (0.20.3)
pandocfilters (1.4.2)
param (1.5.1)
parso (0.1.1)
partd (0.3.8)
path.py (10.3.1)
pathlib2 (2.3.0)
patsy (0.4.1)
pep8 (1.7.0)
pexpect (4.3.1)
pickleshare (0.7.4)
Pillow (4.2.1)
pip (9.0.1)
pivottablejs (0.8.1)
pkginfo (1.4.1)
ply (3.10)
prettypandas (0.0.3)
progress (1.3)
prompt-toolkit (1.0.15)
psutil (5.4.0)
ptyprocess (0.5.2)
py (1.4.34)
pycodestyle (2.3.1)
pycosat (0.6.2)
pycparser (2.18)
pycrypto (2.6.1)
pycurl (7.43.0)
pyflakes (1.6.0)
Pygments (2.2.0)
pylint (1.7.4)
pymarkdown (0.1.4)
pymatbridge (0.5.2)
pyodbc (4.0.17)
pyOpenSSL (17.2.0)
pyparsing (2.2.0)
PySocks (1.6.7)
pytest (3.2.1)
python-dateutil (2.6.1)
python-editor (1.0.3)
python-oauth2 (1.0.1)
pytz (2017.2)
PyWavelets (0.5.2)
pywin32 (221)
pywinpty (0.5.1)
PyYAML (3.12)
pyzmq (16.0.3)
QtAwesome (0.4.4)
qtconsole (4.3.1)
QtPy (1.3.1)
requests (2.18.4)
requests-file (1.4.2)
rope (0.10.5)
rpy2 (2.8.6)
ruamel-yaml (0.11.14)
scikit-image (0.13.0)
scikit-learn (0.19.1)
scipy (0.19.1)
seaborn (0.8)
Send2Trash (1.4.2)
setuptools (38.4.0)
simplegeneric (0.8.1)
singledispatch (3.4.0.3)
six (1.11.0)
snowballstemmer (1.2.1)
sortedcollections (0.5.3)
sortedcontainers (1.5.7)
Sphinx (1.6.3)
sphinxcontrib-websupport (1.0.1)
spyder (3.2.4)
SQLAlchemy (1.1.13)
statsmodels (0.8.0)
sympy (1.1.1)
tables (3.4.2)
tblib (1.3.2)
terminado (0.8.1)
testpath (0.3.1)
tldextract (2.2.0)
toolz (0.9.0)
tornado (4.5.3)
traitlets (4.3.2)
typing (3.6.2)
unicodecsv (0.14.1)
urllib3 (1.22)
wcwidth (0.1.7)
webencodings (0.5.1)
Werkzeug (0.12.2)
wes (0.1.5)
wheel (0.29.0)
widgetsnbextension (3.0.2)
win-inet-pton (1.0.1)
win-unicode-console (0.5)
wincertstore (0.2)
wrapt (1.10.11)
xlrd (1.1.0)
XlsxWriter (1.0.2)
xlwings (0.11.4)
xlwt (1.3.0)
zict (0.1.3)
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.

Helpful shortcuts

  • While coding, SHIFT+TAB will bring up help for your current function
  • CTRL+Enter executes the current cell, keeping your focus on it
  • CTRL+SHIFT+Enter executes the current cell, and moves you down to the next cell
  • ALT+Enter executes the current cell AND makes a new one below
  • ESC brings you to command mode, where you can do a number of things:
    • A makes a new cell above
    • B makes a new cell below
    • D D (that's D twice) deletes a cell
    • X cuts selected cells
    • C copies the cells
    • V pastes the cells
    • Y turns the cell into code
    • M turns the cell into Markdown
  • CTRL+SHIFT+F brings up the command palette, with all available commands
    You can also view and edit such shortcuts from the "Help" menu at the top of the screen

Beginner data analysis with pandas

In-depth pandas tutorial

Giant pandas tutorial and attendant notes available at the links.

Setup

Allow plots in the notebook itself, and enable some helpful functions.

In [8]:
%reset -f
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # High-res graphs (rendered irrelevant by svg option below)
%config InlineBackend.print_figure_kwargs = {'bbox_inches':'tight'} # No extra white space
%config InlineBackend.figure_format = 'svg' # 'png' is default

import warnings
warnings.filterwarnings('ignore') # Because we are adults

Import example data.

In [9]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

data = sns.load_dataset('tips')
data.head() # show first n entries (default is 5)
Out[9]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Data exploration

Change default graph appearance to something you like. See here for full list of available built-in styles.

In [10]:
sns.set_style("ticks") # e.g., ggplot, whitegrid, etc.

## Define custom color palette
# flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
# sns.set_palette(flatui)
# sns.palplot(sns.color_palette())

Plot histograms of tips grouped by sex side by side. Make sure both have the same x and y limits.

In [11]:
data['tip'].hist(by=data['sex'], sharex=True, sharey=True)
sns.despine() # Remove top and right side of box

plt.show() # Somewhat redundant in this context, but suppresses annoying text output.

Plot overlaid histograms.

In [12]:
grouped_by_sex = data.groupby('sex')

# You can also add several arguments below like bins=20, or normed=True
figure, axes = grouped_by_sex['tip'].plot(kind='hist', normed=False, alpha=.5, legend=True) 

# Re-label legend entries, move legend to right-middle
axes.legend(['Men', 'Women'], loc=(0.75, 0.5)) 

sns.despine()
plt.show()

Show summary stats for the sexes.

In [13]:
grouped_by_sex['tip'].describe()
Out[13]:
count mean std min 25% 50% 75% max
sex
Male 157.0 3.089618 1.489102 1.0 2.0 3.00 3.76 10.0
Female 87.0 2.833448 1.159495 1.0 2.0 2.75 3.50 6.5

Get a subset of the data — here the tips given on Sunday at dinner time.

In [14]:
sunday_dinner_tips = data.tip[(data.day=="Sun") & (data.time=="Dinner")]

Data processing

Add a new column showing the percentage of the total bill tipped using a lambda expression. Naturally, you can also accomplish this by defining a named function.

In [15]:
data['tip_percentage'] = data.apply(lambda row: row['tip']/row['total_bill']*100, axis=1)
data.head()
Out[15]:
total_bill tip sex smoker day time size tip_percentage
0 16.99 1.01 Female No Sun Dinner 2 5.944673
1 10.34 1.66 Male No Sun Dinner 3 16.054159
2 21.01 3.50 Male No Sun Dinner 3 16.658734
3 23.68 3.31 Male No Sun Dinner 2 13.978041
4 24.59 3.61 Female No Sun Dinner 4 14.680765

Delete that new column.

In [16]:
del data['tip_percentage']
data.head()
Out[16]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

ANOVA


Perform an ANOVA, using R-style syntax.

In [17]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = 'tip ~ sex * smoker'
lm = ols(model, data=data).fit()
table = sm.stats.anova_lm(lm, typ=2)

display(table)
sum_sq df F PR(>F)
sex 3.672183 1.0 1.912247 0.167999
smoker 0.015000 1.0 0.007811 0.929648
sex:smoker 0.639891 1.0 0.333216 0.564313
Residual 460.884051 240.0 NaN NaN

Make the table prettier and more intelligible.

In [18]:
from prettypandas import PrettyPandas 

def color_significant_green(val, alpha=0.05):
    if val < alpha: color = 'green'
    else: color = 'black'
    return 'color: %s' % color

def bold_significant(val, alpha=0.05):
    if val < alpha: font_weight = 'bold'
    else: font_weight = 'normal'
    return 'font-weight: %s' % font_weight

t = PrettyPandas(table)
(
    t.applymap(color_significant_green, alpha=.05, subset=['PR(>F)']) # alpha is optional here, of course
    .applymap(bold_significant, alpha=.05, subset=['PR(>F)'])
    .format("{:.3f}", subset=['sum_sq', 'F', 'PR(>F)']) # show only 3 decimal places
)
Out[18]:
sum_sq df F PR(>F)
sex 3.672 1 1.912 0.168
smoker 0.015 1 0.008 0.930
sex:smoker 0.640 1 0.333 0.564
Residual 460.884 240 nan nan

T-tests

In [19]:
from numpy import sqrt
from scipy.stats import ttest_ind

def cohens_d(t, n):
    return 2*t / sqrt(n - 2)

# Set up empty results table
columns = ['n', 't', 'p', 'd']
index = []
results = pd.DataFrame(index=index, columns=columns)

# Get data for t-test
male_tips = data[data['sex']=='Male']['tip']
female_tips = data[data['sex']=='Female']['tip']

# Perform t-test and surrounding calculations
n = male_tips.count() + female_tips.count()
df = n-2
t, p = ttest_ind(male_tips, female_tips)
d = cohens_d(t, n)

# Add data to table
comparison = 'Male vs. Female'
results.loc[comparison] = [n, t, p, d]

# Output pretty table
r = PrettyPandas(results)
(
    r.applymap(color_significant_green, subset=['p'])
    .applymap(bold_significant, subset=['p'])
    .format("{:.3f}", subset=['t', 'p', 'd'])
)
Out[19]:
n t p d
Male vs. Female 244 1.388 0.166 0.178

Publication-ready statistics with Markdown

In [20]:
from IPython.display import Markdown

inequality_symbol = "="
def report_t_test(df, t, p, d, alpha=.001):
    if p < alpha:
        p = .001
        inequality_symbol = "<"
    else:
        inequality_symbol = "="
        
    T = format(t, '.2f').lstrip('0') # 2 decimal places, no leading 0
    P = format(p, '.3f').lstrip('0') 
    D = format(d, '.3f').lstrip('0') 
    DF = format(df, 'd') # integer
    
    output = ('*t*({0})={1}, *p*' + inequality_symbol + '{2}, *d*={3}').format(DF, T, P, D)
    display(Markdown(output))

report_t_test(df, t, p, d)

t(242)=1.39, p=.166, d=.178

And in plain markdown: t({{n-2}})={{format(t, '.2f').lstrip('0')}}, p{{inequality_symbol}}{{format(p, '.3f').lstrip('0')}}, d={{format(d, '.3f').lstrip('0')}}

Note that you can copy paste such outputs directly into Word with no loss of formatting!

Repeated measures ANOVA

Requires development version of statsmodels package, available here.

In [19]:
import pandas as pd
import numpy as np
import statsmodels
from statsmodels.stats.anova import AnovaRM
statsmodels.__version__
Out[19]:
'0.8.0.dev0+91ed779'

Create simulated reaction time data for 2 levels of an independent variable.

In [20]:
N = 20
P = [1,2]

values = [998,511]
 
sub_id = [i+1 for i in range(N)]*len(P)
mus = np.concatenate([np.repeat(value, N) for value in values]).tolist()
rt = np.random.normal(mus, scale=112.0, size=N*len(P)).tolist()
iv = np.concatenate([np.array([p]*N) for p in P]).tolist()

df = pd.DataFrame({'id': sub_id, 'rt': rt, 'iv':iv})

Do the repeated measures ANOVA.

In [21]:
aovrm = AnovaRM(df, depvar='rt', subject='id', within=['iv'])
fit = aovrm.fit()
fit.summary()
Out[21]:
F Value Num DF Den DF Pr > F
iv 140.1779 1.0000 19.0000 0.0000

Plots

Line graph with matplotlib

Plot simple line graph with sample data.

In [33]:
line_data = range(1,10)

plt.figure()
plt.title("Example Graph", size="xx-large") # can also feed font point size, like 36
plt.xlabel("X-Axis Label", size="x-large")
plt.ylabel("Y-Axis Label", size="x-large")
plt.xlim(0,10)
plt.ylim(0,10)
plt.plot(line_data, 'b*-', markersize=10, linewidth=3, label='Sample Data') # b*- means blue star marker with line
plt.tick_params(axis="both", which="major", labelsize=14)
plt.legend(loc=(0.25, 0.75), scatterpoints=1)
plt.show()

Line graph with Seaborn

Plot Anscombe's quartet.

In [59]:
import seaborn as sns
sns.set(style="ticks")

# Load the example dataset for Anscombe's quartet
anscombe = sns.load_dataset("anscombe")

# Show the results of a linear regression within each dataset
# Semi-colon suppresses the non-graph output
ax = sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=anscombe,
                col_wrap=2, ci=None, palette="muted", size=4,
                scatter_kws={"s": 50, "alpha": 1}); 

# Change axis labels
ax.set(xlabel='X', ylabel='Y');

Bar graph

Naturally, this defaults to showing a 95% confidence interval.

In [36]:
ax = sns.barplot(x="day", y="total_bill", data=data, capsize=0.1)

Subplots — Violin plot and beeswarm plot

Plot violin plot with overlaid beeswarm plot.

In [23]:
fig, ax = plt.subplots()

# Output to the size of A4 paper
fig.set_size_inches(11.7, 8.27)

# Overlay a swarmplot on top of a violinplot
ax = sns.violinplot(x="day", y="total_bill", data=data, inner=None)
ax = sns.swarmplot(x="day", y="total_bill", data=data, color="white")

Factor Plots

In [46]:
def set_titles(thisPlot, titleList, fontSize):
    for ax, title in zip(thisPlot.axes.flat, titleList):
        ax.set_title(title, fontsize=fontSize)

        
def set_labels(thisPlot, xLabel, yLabel, fontSize):
    thisPlot.set_xlabels(xLabel, fontsize=fontSize)
    thisPlot.set_ylabels(yLabel, fontsize=fontSize)

    
def set_xtick_labels(thisPlot, tickList, fontSize):
    thisPlot.set_xticklabels(tickList, fontsize=fontSize)

    
def set_legend(thisPlot, legendEntries, fontSize):
    # find where last graph is so we can put the legend there
    maxIndex = max(thisPlot.axes.shape) - 1
    
    # format the legend, placing it outside the axes
    thisPlot.axes[0][maxIndex].legend(bbox_to_anchor=(1.05, 1), loc=2, 
                                      fontsize=fontSize, borderaxespad=0.)
    legend = thisPlot.axes[0][maxIndex].get_legend()
    labels = legend.get_texts()
    for i, thisLabel in enumerate(labels):
        labels[i].set_text(legendEntries[i])        


# Make plots -- many of these arguments are optional
barPlot = sns.factorplot(x="day", y="total_bill", hue="sex", 
                         col="time", kind="bar", data=data, 
                         size=5, aspect=1, legend=False)

beeswarmPlot = sns.factorplot(x="day", y="total_bill", hue="sex", 
                              col="time", kind="swarm", dodge=True,
                              data=data, size=5, aspect=1, legend=False)

# Format them nicely!
# Axis labels
xLabel = ""# "Day"
yLabel = "Total Bill"
set_labels(barPlot, xLabel, yLabel, 20)
set_labels(beeswarmPlot, xLabel, yLabel, 20)

# Titles
title_list = ["Lunch", "Dinner"]
titles = [x.title() for x in title_list] # ["Bimodal", "Normal", "Skewed"]
set_titles(barPlot, titles, 30)
set_titles(beeswarmPlot, titles, 30)

# X axis tick labels or category labels
x_tick_labels = ["Thursday", "Friday", "Saturday", "Sunday"]
set_xtick_labels(barPlot, x_tick_labels, 15)
set_xtick_labels(beeswarmPlot, x_tick_labels, 15)

# Change legends
legendEntries = ["Male", "Female"]
set_legend(barPlot, legendEntries, 15)
set_legend(beeswarmPlot, legendEntries, 15)

# Save plots
# barPlot.savefig("barPlot.svg") # can also use other extensions, like .png
# beeswarmPlot.savefig("beePlot.svg")

Interactive Plots

Bokeh

Made using bokeh. See here for a great tutorial, and here for the attendant notebook. Code below adapted from linked code to our current dataset.

In [25]:
from bokeh.plotting import figure, output_notebook, show

this_plot= figure(width=600, height=600)

this_plot.circle(x=data['total_bill'], y=data['tip'], size=10, alpha=0.7)
output_notebook() # to output inline 
show(this_plot)
Loading BokehJS ...

Make better, more interactive plot. Let's plot a scatterplot of tip amount vs. total bill, separately for men and women.

See here for more information on styling Bokeh plots.

In [26]:
from bokeh.plotting import figure, output_notebook, show, ColumnDataSource
import bokeh.models.tools as tools

# Get relevant subsets of data
male_data = data[data['sex'] == 'Male']
female_data = data[data['sex'] == 'Female']

# Convert to format bokeh understands
source_male = ColumnDataSource(male_data)
source_female = ColumnDataSource(female_data)

# Set up figure
this_plot = figure(width=600, height=600)

this_plot.circle(source=source_male, x='total_bill', y='tip', color='teal',
         size=10, alpha=0.7, legend='Men')

this_plot.circle(source=source_female, x='total_bill', y='tip', color='darkorange',
         size=10, alpha=0.7, legend='Women')

# Set axis labels
this_plot.xaxis.axis_label = "Total Bill"
this_plot.yaxis.axis_label = "Tip Amount"

# Show information when hovering the mouse over datapoints
this_plot.add_tools(tools.HoverTool(tooltips=[('Day', '@day')])) # use @ to choose feature from dataset

# Hide all circles of a given category when clicked in legend
this_plot.legend.click_policy = 'hide' 

output_notebook() 
show(this_plot)
Loading BokehJS ...
In [28]:
import holoviews as hv
hv.extension('bokeh', 'matplotlib')

ds = hv.Dataset(data, kdims=["sex", "smoker", "total_bill"],
                      vdims=["time", "size", "day", "tip"])
In [29]:
%%output backend='bokeh'
%%output size=200
%%opts Scatter [tools=['hover']] (size=8 alpha=0.5)

kdims=["tip"]
vdims=["total_bill", "day", "time", "size"] # include "smoker" if you don't want it as drop-down choice

# Scatter plot with hover tool that includes all the things
scatter = ds.to(hv.Scatter, kdims, vdims).overlay('sex')
scatter
Out[29]:

Pivot table plots

In [23]:
from pivottablejs import pivot_ui
pivot_ui(data)
Out[23]:

Interactive slider

In [44]:
import matplotlib.pyplot as plt
from ipywidgets import *
from numpy import pi, arange, sin

t = arange(0, 1.0, 0.01)


def pltsin(f):
    plt.plot(t, sin(2*pi*t*f))
    plt.show()
    
interact(pltsin, f=(1,10,0.1))
Out[44]:
<function __main__.pltsin>

Plotly

Plotly is another package for producing really nice and interactive graphs, but it requires signing up for an account to initialize it. After initialization you can use it online by default (which means all of your graphs get saved to the cloud for everyone to see forever) or you can use it offline (as demoed below). Examples taken or modified from here.

Setup and basic line graph

In [24]:
import plotly
# plotly.tools.set_credentials_file(username='XXX', api_key='XXX') # initialize with your credentials -- only need to do once ever.
from plotly.graph_objs import Scatter, Layout

plotly.offline.init_notebook_mode(connected=True)

plotly.offline.iplot({
    "data": [Scatter(x=[1, 2, 3, 4], y=[4, 3, 2, 1])],
    "layout": Layout(title="hello world")
})

Troubleshooting setup

When I first tried using plotly I sometimes got "IOPub data rate exceeded" errors. Here's how you fix that:

  • run jupyter notebook --generate-config to generate a clean configuration file with all parameters commented out
  • modify c.NotebookApp.iopub_data_rate_limit and c.NotebookApp.iopub_msg_rate_limit to be some absurdly large numbers

Tables

In [25]:
import plotly.offline as py
import plotly.figure_factory as ff

df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/school_earnings.csv")

table = ff.create_table(df)
py.iplot(table, filename='plotly\table1')

Bar graphs

In [26]:
import plotly.offline as py
from plotly.graph_objs import *
data = [Bar(x=df.School,
            y=df.Gap)]

py.iplot(data)
In [27]:
trace_women = Bar(x=df.School,
                  y=df.Women,
                  name='Women',
                  marker=dict(color='#ffcdd2'))

trace_men = Bar(x=df.School,
                y=df.Men,
                name='Men',
                marker=dict(color='#A2D5F2'))

trace_gap = Bar(x=df.School,
                y=df.Gap,
                name='Gap',
                marker=dict(color='#59606D'))

data = [trace_women, trace_men, trace_gap]
layout = Layout(title="Average Earnings for Graduates",
                xaxis=dict(title='School'),
                yaxis=dict(title='Salary (in thousands)'))
fig = Figure(data=data, layout=layout)

py.iplot(fig)

Interactive slider

In [28]:
data = [dict(
        visible = False,
        line=dict(color='00CED1', width=6),
        name = '𝜈 = '+str(step),
        x = np.arange(0,10,0.01),
        y = np.sin(step*np.arange(0,10,0.01))) for step in np.arange(0,5,0.1)]
data[10]['visible'] = True

steps = []
for i in range(len(data)):
    step = dict(
        method = 'restyle',
        args = ['visible', [False] * len(data)],
    )
    step['args'][1][i] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active = 10,
    currentvalue = {"prefix": "Frequency: "},
    pad = {"t": 50},
    steps = steps
)]

layout = dict(sliders=sliders)
fig = dict(data=data, layout=layout)

py.iplot(fig)

Interactive 3D Plots

In [29]:
s = np.linspace(0, 2 * np.pi, 240)
t = np.linspace(0, np.pi, 240)
tGrid, sGrid = np.meshgrid(s, t)

r = 2 + np.sin(7 * sGrid + 5 * tGrid)  # r = 2 + sin(7s+5t)
x = r * np.cos(sGrid) * np.sin(tGrid)  # x = r*cos(s)*sin(t)
y = r * np.sin(sGrid) * np.sin(tGrid)  # y = r*sin(s)*sin(t)
z = r * np.cos(tGrid)                  # z = r*cos(t)

surface = Surface(x=x, y=y, z=z)
data = Data([surface])

layout = Layout(
    title='Parametric Plot',
    scene=Scene(
        xaxis=XAxis(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        yaxis=YAxis(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        zaxis=ZAxis(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        )
    )
)

fig = Figure(data=data, layout=layout)
py.iplot(fig)

Other plot aesthetics

Wes Anderson color palettes

You can generate these with the wes Python package.

That said, installation can be a little annoying, since you will often get an error for missing the colors.json file. If you get that error, simply download the tarball of the latest version of the package, extract colors.json and place it in the appropriate location (i.e., where the error tells you it cannot be found).

In [38]:
import wes
wes.available(show=True)

And set the palette with the following code:

In [43]:
wes.set_palette('Darjeeling')

for i in range(10):
    plt.plot(range(100), np.random.normal(i, 1, 100))

Debugging in Jupyter Notebooks

Use set_trace() where you want the debugger to start.
'n' moves onto the next line
'c' continues execution of the script

In [ ]:
from IPython.core.debugger import set_trace

def increment_value(a):
    a += 1
    set_trace()
    print(a)

increment_value(3)

Other Python tricks

Get a function's source code

In [5]:
import inspect
import numpy as np

print(inspect.getsource(np))
"""
NumPy
=====

Provides
  1. An array object of arbitrary homogeneous items
  2. Fast mathematical operations over arrays
  3. Linear Algebra, Fourier Transforms, Random Number Generation

How to use the documentation
----------------------------
Documentation is available in two forms: docstrings provided
with the code, and a loose standing reference guide, available from
`the NumPy homepage <http://www.scipy.org>`_.

We recommend exploring the docstrings using
`IPython <http://ipython.scipy.org>`_, an advanced Python shell with
TAB-completion and introspection capabilities.  See below for further
instructions.

The docstring examples assume that `numpy` has been imported as `np`::

  >>> import numpy as np

Code snippets are indicated by three greater-than signs::

  >>> x = 42
  >>> x = x + 1

Use the built-in ``help`` function to view a function's docstring::

  >>> help(np.sort)
  ... # doctest: +SKIP

For some objects, ``np.info(obj)`` may provide additional help.  This is
particularly true if you see the line "Help on ufunc object:" at the top
of the help() page.  Ufuncs are implemented in C, not Python, for speed.
The native Python help() does not know how to view their help, but our
np.info() function does.

To search for documents containing a keyword, do::

  >>> np.lookfor('keyword')
  ... # doctest: +SKIP

General-purpose documents like a glossary and help on the basic concepts
of numpy are available under the ``doc`` sub-module::

  >>> from numpy import doc
  >>> help(doc)
  ... # doctest: +SKIP

Available subpackages
---------------------
doc
    Topical documentation on broadcasting, indexing, etc.
lib
    Basic functions used by several sub-packages.
random
    Core Random Tools
linalg
    Core Linear Algebra Tools
fft
    Core FFT routines
polynomial
    Polynomial tools
testing
    NumPy testing tools
f2py
    Fortran to Python Interface Generator.
distutils
    Enhancements to distutils with support for
    Fortran compilers support and more.

Utilities
---------
test
    Run numpy unittests
show_config
    Show numpy build configuration
dual
    Overwrite certain functions with high-performance Scipy tools
matlib
    Make everything matrices.
__version__
    NumPy version string

Viewing documentation using IPython
-----------------------------------
Start IPython with the NumPy profile (``ipython -p numpy``), which will
import `numpy` under the alias `np`.  Then, use the ``cpaste`` command to
paste examples into the shell.  To see which functions are available in
`numpy`, type ``np.<TAB>`` (where ``<TAB>`` refers to the TAB key), or use
``np.*cos*?<ENTER>`` (where ``<ENTER>`` refers to the ENTER key) to narrow
down the list.  To view the docstring for a function, use
``np.cos?<ENTER>`` (to view the docstring) and ``np.cos??<ENTER>`` (to view
the source code).

Copies vs. in-place operation
-----------------------------
Most of the functions in `numpy` return a copy of the array argument
(e.g., `np.sort`).  In-place versions of these functions are often
available as array methods, i.e. ``x = np.array([1,2,3]); x.sort()``.
Exceptions to this rule are documented.

"""
from __future__ import division, absolute_import, print_function

import sys
import warnings

from ._globals import ModuleDeprecationWarning, VisibleDeprecationWarning
from ._globals import _NoValue

# We first need to detect if we're being called as part of the numpy setup
# procedure itself in a reliable manner.
try:
    __NUMPY_SETUP__
except NameError:
    __NUMPY_SETUP__ = False

if __NUMPY_SETUP__:
    sys.stderr.write('Running from numpy source directory.\n')
else:
    try:
        from numpy.__config__ import show as show_config
    except ImportError:
        msg = """Error importing numpy: you should not try to import numpy from
        its source directory; please exit the numpy source tree, and relaunch
        your python interpreter from there."""
        raise ImportError(msg)

    from .version import git_revision as __git_revision__
    from .version import version as __version__

    from ._import_tools import PackageLoader

    def pkgload(*packages, **options):
        loader = PackageLoader(infunc=True)
        return loader(*packages, **options)

    from . import add_newdocs
    __all__ = ['add_newdocs',
               'ModuleDeprecationWarning',
               'VisibleDeprecationWarning']

    pkgload.__doc__ = PackageLoader.__call__.__doc__

    # We don't actually use this ourselves anymore, but I'm not 100% sure that
    # no-one else in the world is using it (though I hope not)
    from .testing import Tester
    test = testing.nosetester._numpy_tester().test
    bench = testing.nosetester._numpy_tester().bench

    # Allow distributors to run custom init code
    from . import _distributor_init

    from . import core
    from .core import *
    from . import compat
    from . import lib
    from .lib import *
    from . import linalg
    from . import fft
    from . import polynomial
    from . import random
    from . import ctypeslib
    from . import ma
    from . import matrixlib as _mat
    from .matrixlib import *
    from .compat import long

    # Make these accessible from numpy name-space
    # but not imported in from numpy import *
    if sys.version_info[0] >= 3:
        from builtins import bool, int, float, complex, object, str
        unicode = str
    else:
        from __builtin__ import bool, int, float, complex, object, unicode, str

    from .core import round, abs, max, min

    __all__.extend(['__version__', 'pkgload', 'PackageLoader',
               'show_config'])
    __all__.extend(core.__all__)
    __all__.extend(_mat.__all__)
    __all__.extend(lib.__all__)
    __all__.extend(['linalg', 'fft', 'random', 'ctypeslib', 'ma'])


    # Filter annoying Cython warnings that serve no good purpose.
    warnings.filterwarnings("ignore", message="numpy.dtype size changed")
    warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
    warnings.filterwarnings("ignore", message="numpy.ndarray size changed")

    # oldnumeric and numarray were removed in 1.9. In case some packages import
    # but do not use them, we define them here for backward compatibility.
    oldnumeric = 'removed'
    numarray = 'removed'
 
 
__mkl_version__ = "2018" 
 

Find where a function lives

In [8]:
inspect.getfile(np)
Out[8]:
'c:\\anaconda3\\envs\\py36\\lib\\site-packages\\numpy\\__init__.py'

Miscellany

If you want to start digging deeper into Python, you can learn some cool things here, and here, and here.

That said, here is my favorite random snippet of python code ever. You can swap variable values without needing any temporary variables via tuple unpacking.

In [59]:
a = "A"
b = "B"

# Swap!
a, b = b, a 

print("a = " + a)
print("b = " + b)
a = B
b = A

And extended unpacking is interesting to wrap your head around (Python 3 only).

In [60]:
a, *b, c = [1, 2, 3, 4, 5, 6]
print(a)
print(b)
print(c)
1
[2, 3, 4, 5]
6

List comprehensions are also extremely useful, allowing you to program almost as if you were writing a sentence in English.

In [28]:
# get sum of squares of numbers taken from the range 1 to 10
sum(i**2 for i in range(11))
Out[28]:
385

Zipping lists is another one of my favorite features.

In [32]:
a = ['a', 'b', 'c']
b = [1, 2, 3]

c = zip(a, b)
print(list(c)) # need to cast into a list because a zip object is a generator
[('a', 1), ('b', 2), ('c', 3)]

Run R code


Note that this requires running from a Python 3 instance of Jupyter (in my case, at least).

R for Jupyter installation instructions:

In theory, you should just be able to run this line and be all set, but it didn't work for me: conda install -c r r-essentials

If that didn't work, go through these steps:

  • In R (not RStudio), run the following:
    install.packages('devtools')
    devtools::install_github('IRkernel/IRkernel')
    IRkernel::installspec()  # to register the kernel in the current R installation
    
  • make sure you have R added to your PATH (in my case, C:\Program Files\R\R-3.3.3\bin\x64)
    • Windows: Need R_HOME (same path as above) and R_USER (just your windows user name) added as separate environment vars
  • Install libraries like ggplot2 directly into R itself, not RStudio: install.packages('ggplot2', dependencies=TRUE)
  • Mac/Linux: Run pip install rpy2 from your command line/terminal
    • Windows: get appropriate installation from here, and run pip install rpy2‑2.8.6‑cp36‑cp36m‑win_amd64.whl or whatever your .whl file is called from within the directory that has the file.
      • You may also need to add the following two directories to your PATH: C:\Anaconda3\Library\mingw-w64\bin; C:\Anaconda3\Library\mingw-w64\lib
  • See here for further information if needed

Example Python to R pipeline

First, make some example data in Python.

In [1]:
import pandas as pd
df = pd.DataFrame({'Letter': ['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],
                   'X': [4, 3, 5, 2, 1, 7, 7, 5, 9],
                   'Y': [0, 4, 3, 6, 7, 10, 11, 9, 13],
                   'Z': [1, 2, 3, 1, 2, 3, 1, 2, 3]})

Load extension allowing one to run R code from within a Python notebook.

In [2]:
%load_ext rpy2.ipython

Do stuff in R with cell or line magics. "-i" imports to R, "-o" outputs from R back to Python.

In [36]:
%%R 
install.packages("ggplot2", dep=TRUE)
install.packages("tidyr", dep=TRUE)
install.packages("dplyr", dep=TRUE)
In [3]:
%%R -i df
library("ggplot2")
ggplot(data = df) + geom_point(aes(x = X, y = Y, color = Letter, size = Z))

Run MATLAB code

MATLAB for Jupyter installation

pip install matlab_kernel
pip install pymatbridge

If you're getting a "zmq channel closed" error, open jupyter notebook from a different port when using MATLAB

jupyter notebook --port=8889

Example Python to MATLAB pipeline

Load MATLAB extension for running MATLAB code within a Python notebook.

In [45]:
%load_ext pymatbridge
Starting MATLAB on ZMQ socket tcp://127.0.0.1:56960
Send 'exit' command to kill the server
..............................MATLAB started and connected!

Let's try transposing an array from Python in MATLAB, then feeding it back into Python.

First, define an array.

In [47]:
a = [
    [1, 2],
    [3, 4],
    [5, 6]
]
a
Out[47]:
[[1, 2], [3, 4], [5, 6]]

Now transpose it easily in MATLAB!

In [48]:
%%matlab -i a -o a
a = a'

a =



     1     3     5

     2     4     6



Finally, check that Python has the correct value of a.

In [49]:
a
Out[49]:
array([[ 1.,  3.,  5.],
       [ 2.,  4.,  6.]])

Here's an example of a MATLAB plot.

In [39]:
%%matlab
b = linspace(0.01,6*pi,100);
plot(sin(b))
grid on
hold on
plot(cos(b),'r')

Exit MATLAB when done.

In [50]:
%unload_ext pymatbridge
MATLAB closed

Run Javascript code


Note that Javascript executes as the notebook is opened, even if it's been exported as HTML!

In [41]:
%%javascript
console.log('hey!')